Model Selection

Multimodal interaction

# Multimodal interaction

Llama 4 Scout 17B 16E Instruct INT4

The Llama 4 series is a native multimodal AI model launched by Meta. It adopts the Mixture of Experts architecture, supports text and image interaction, and performs excellently in various language and visual tasks.

Multimodal Fusion

Transformers Supports Multiple Languages

Llama 4 Scout 17B 16E Instruct FP8

The Llama 4 series is a native multimodal AI model launched by Meta, supporting text and image interaction. It adopts the Mixture of Experts architecture and performs excellently in text and image understanding.

Multimodal Fusion

Transformers Supports Multiple Languages

Qwen.qwen2 VL 2B GGUF

Qwen2-VL-2B is a multimodal model that can handle image and text inputs and generate text outputs.

Videochatonline 4B

VideoChat-Online is an online video understanding model based on Phi-3-vision-128k-instruct, focusing on the video text-to-text task.

UGround is a powerful GUI visual positioning model trained with a simple recipe, developed in collaboration by OSU NLP Group and Orby AI.

Transformers English

PAE-LLaVa-7B is a foundation model Internet intelligent agent based on the PAE (Proposer-Agent-Evaluator) framework, focusing on autonomous skill discovery.

An Any-to-Any subnet model developed in collaboration by OMEGA Labs and Bittensor, supporting multiple task conversions

Large Language Model Other

Mini-Omni2 is a fully interactive multimodal model capable of understanding image, audio, and text inputs, and engaging in end-to-end voice conversations with users.

Multimodal Fusion

Mixtral AI Vision 128k 7b

A multimodal model that combines visual and language abilities, achieving image-text interaction through a merging method

Transformers English

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase